Inspired by progress in large-scale language modeling, we apply a similar approach towards building a single generalist agent beyond the realm of text outputs. The agent, which we refer to as Gato, works as a multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens. In this report we describe the model and the data, and document the current capabilities of Gato.
translated by 谷歌翻译
密集对象跟踪,能够通过像素级精度本地化特定的对象点,是一个重要的计算机视觉任务,具有多种机器人的下游应用程序。现有方法在单个前向通行证中计算密集的键盘嵌入,这意味着模型培训以一次性跟踪所有内容,或者将它们的全部容量分配给稀疏预定义的点,交易一般性以获得准确性。在本文中,我们基于观察到给定时间的相关点数通常相对较少,例如,探索中间地面。掌握目标对象的点。我们的主要贡献是一种新颖的架构,灵感来自少量任务适应,这允许一个稀疏样式的网络在嵌入点嵌入的关键点嵌入时的条件。我们的中央发现是,这种方法提供了密集嵌入模型的一般性,同时提供准确性更加接近稀疏关键点方法。我们呈现了说明此容量与准确性权衡的结果,并使用真正的机器人挑选任务展示将转移到新对象实例(在课程中)的能力。
translated by 谷歌翻译
Deep neural networks (DNNs) are found to be vulnerable to adversarial attacks, and various methods have been proposed for the defense. Among these methods, adversarial training has been drawing increasing attention because of its simplicity and effectiveness. However, the performance of the adversarial training is greatly limited by the architectures of target DNNs, which often makes the resulting DNNs with poor accuracy and unsatisfactory robustness. To address this problem, we propose DSARA to automatically search for the neural architectures that are accurate and robust after adversarial training. In particular, we design a novel cell-based search space specially for adversarial training, which improves the accuracy and the robustness upper bound of the searched architectures by carefully designing the placement of the cells and the proportional relationship of the filter numbers. Then we propose a two-stage search strategy to search for both accurate and robust neural architectures. At the first stage, the architecture parameters are optimized to minimize the adversarial loss, which makes full use of the effectiveness of the adversarial training in enhancing the robustness. At the second stage, the architecture parameters are optimized to minimize both the natural loss and the adversarial loss utilizing the proposed multi-objective adversarial training method, so that the searched neural architectures are both accurate and robust. We evaluate the proposed algorithm under natural data and various adversarial attacks, which reveals the superiority of the proposed method in terms of both accurate and robust architectures. We also conclude that accurate and robust neural architectures tend to deploy very different structures near the input and the output, which has great practical significance on both hand-crafting and automatically designing of accurate and robust neural architectures.
translated by 谷歌翻译
Vision-based tactile sensors have gained extensive attention in the robotics community. The sensors are highly expected to be capable of extracting contact information i.e. haptic information during in-hand manipulation. This nature of tactile sensors makes them a perfect match for haptic feedback applications. In this paper, we propose a contact force estimation method using the vision-based tactile sensor DIGIT, and apply it to a position-force teleoperation architecture for force feedback. The force estimation is done by building a depth map for DIGIT gel surface deformation measurement and applying a regression algorithm on estimated depth data and ground truth force data to get the depth-force relationship. The experiment is performed by constructing a grasping force feedback system with a haptic device as a leader robot and a parallel robot gripper as a follower robot, where the DIGIT sensor is attached to the tip of the robot gripper to estimate the contact force. The preliminary results show the capability of using the low-cost vision-based sensor for force feedback applications.
translated by 谷歌翻译
Recently, evolutionary multitasking (EMT) has been successfully used in the field of high-dimensional classification. However, the generation of multiple tasks in the existing EMT-based feature selection (FS) methods is relatively simple, using only the Relief-F method to collect related features with similar importance into one task, which cannot provide more diversified tasks for knowledge transfer. Thus, this paper devises a new EMT algorithm for FS in high-dimensional classification, which first adopts different filtering methods to produce multiple tasks and then modifies a competitive swarm optimizer to efficiently solve these related tasks via knowledge transfer. First, a diversified multiple task generation method is designed based on multiple filtering methods, which generates several relevant low-dimensional FS tasks by eliminating irrelevant features. In this way, useful knowledge for solving simple and relevant tasks can be transferred to simplify and speed up the solution of the original high-dimensional FS task. Then, a competitive swarm optimizer is modified to simultaneously solve these relevant FS tasks by transferring useful knowledge among them. Numerous empirical results demonstrate that the proposed EMT-based FS method can obtain a better feature subset than several state-of-the-art FS methods on eighteen high-dimensional datasets.
translated by 谷歌翻译
Many state-of-the-art natural language understanding (NLU) models are based on pretrained neural language models. These models often make inferences using information from multiple sources. An important class of such inferences are those that require both background knowledge, presumably contained in a model's pretrained parameters, and instance-specific information that is supplied at inference time. However, the integration and reasoning abilities of NLU models in the presence of multiple knowledge sources have been largely understudied. In this work, we propose a test suite of coreference resolution tasks that require reasoning over multiple facts. Our dataset is organized into subtasks that differ in terms of which knowledge sources contain relevant facts. We evaluate state-of-the-art coreference resolution models on our dataset. Our results indicate that several models struggle to reason on-the-fly over knowledge observed both at pretrain time and at inference time. However, with task-specific training, a subset of models demonstrates the ability to integrate certain knowledge types from multiple sources.
translated by 谷歌翻译
Despite recent progress in Natural Language Understanding (NLU), the creation of multilingual NLU systems remains a challenge. It is common to have NLU systems limited to a subset of languages due to lack of available data. They also often vary widely in performance. We launch a three-phase approach to address the limitations in NLU and help propel NLU technology to new heights. We release a 52 language dataset called the Multilingual Amazon SLU resource package (SLURP) for Slot-filling, Intent classification, and Virtual assistant Evaluation, or MASSIVE, in an effort to address parallel data availability for voice assistants. We organize the Massively Multilingual NLU 2022 Challenge to provide a competitive environment and push the state-of-the art in the transferability of models into other languages. Finally, we host the first Massively Multilingual NLU workshop which brings these components together. The MMNLU workshop seeks to advance the science behind multilingual NLU by providing a platform for the presentation of new research in the field and connecting teams working on this research direction. This paper summarizes the dataset, workshop and the competition and the findings of each phase.
translated by 谷歌翻译
We present hierarchical policy blending as optimal transport (HiPBOT). This hierarchical framework adapts the weights of low-level reactive expert policies, adding a look-ahead planning layer on the parameter space of a product of expert policies and agents. Our high-level planner realizes a policy blending via unbalanced optimal transport, consolidating the scaling of underlying Riemannian motion policies, effectively adjusting their Riemannian matrix, and deciding over the priorities between experts and agents, guaranteeing safety and task success. Our experimental results in a range of application scenarios from low-dimensional navigation to high-dimensional whole-body control showcase the efficacy and efficiency of HiPBOT, which outperforms state-of-the-art baselines that either perform probabilistic inference or define a tree structure of experts, paving the way for new applications of optimal transport to robot control. More material at https://sites.google.com/view/hipobot
translated by 谷歌翻译
解释视觉场景的含义不仅需要识别其成分对象,还需要对象相互关系的丰富语义表征。在这里,我们通过将现代计算技术应用于复杂自然场景引起的人类脑反应的大规模7T fMRI数据集,研究视觉语义转换的神经机制。使用通过将语言深度学习模型应用于人类生成的场景描述获得的语义嵌入,我们确定了编码语义场景描述的大脑区域的广泛分布网络。重要的是,这些语义嵌入比传统对象类别标签更好地解释了这些区域的活动。此外,尽管参与者没有积极从事语义任务,但它们还是活动的有效预测指标,这表明Visuo-Semantic转换是默认的视觉方式。为了支持这种观点,我们表明,可以直接通过大脑活动模式直接将场景字幕的高度精确重建。最后,经过语义嵌入训练的经常性卷积神经网络进一步超过了语义嵌入在预测大脑活动时的语义嵌入,从而提供了大脑视觉语义转换的机械模型。这些实验和计算结果在一起表明,将视觉输入转换为丰富的语义场景描述可能是视觉系统的核心目标,并且将重点放在这一新目标上可能会导致改进人类大脑中视觉信息处理的模型。
translated by 谷歌翻译
目的:我们提出了一个正式的框架,用于使用统一的运动原始图(MPS)作为基本手术动作来建模手术任务,以实现不同数据集的更客观的标记和聚集,并培训通用模型,以实现手术动作识别。方法:我们使用我们的框架来创建上下文和运动原始骨料外科手术集(指南针),包括来自三个公共可用数据集(拼图,桌子,桌子和Rosma)的六个干燥LAB手术任务标签。提出了标记手术环境和自动转换为MPS的方法。我们提出了一项任务(Loto)交叉验证方法,以评估模型概括为看不见的任务的能力。结果:我们的上下文标签方法达到了众包的共识标签与专家外科医生之间的几乎完美的一致性。对MPS的任务进行分割,可以生成单独的左右笔录,并显着改善Loto的性能。我们发现,如果对具有相同上下文的任务和/或来自同一数据集的任务进行了培训,则MP细分模型的性能最佳。结论:所提出的框架可以基于上下文和细粒度的MPS对外科数据进行高质量的标记。使用MPS对外科手术任务进行建模可以使不同数据集的汇总用于训练动作识别模型,这些模型可以比在手势级别训练的模型更好地概括地看不见的任务。意义:我们的正式框架和汇总数据集可以支持用于手术过程分析,技能评估,错误检测和自治的模型和算法的开发。
translated by 谷歌翻译